<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<link rel="STYLESHEET" href="lib.css" type='text/css' />
<link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" />
<link rel='start' href='../index.html' title='Python documentation Index' />
<link rel="first" href="lib.html" title='Python library Reference' />
<link rel='contents' href='contents.html' title="Contents" />
<link rel='index' href='genindex.html' title='Index' />
<link rel='last' href='about.html' title='About this document...' />
<link rel='help' href='about.html' title='About this document...' />
<link rel="next" href="module-sgmllib.html" />
<link rel="prev" href="markup.html" />
<link rel="parent" href="markup.html" />
<link rel="next" href="htmlparser-example.html" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name='aesop' content='information' />
<title>8.1 HTMLParser -- Simple HTML and XHTML parser</title>
</head>
<body>
<div class="navigation">
<div id='top-navigation-panel' xml:id='top-navigation-panel'>
<table align="center" width="100%" cellpadding="0" cellspacing="2">
<tr>
<td class='online-navigation'><a rel="prev" title="8. structured Markup Processing"
  href="markup.html"><img src='../icons/previous.png'
  border='0' height='32'  alt='Previous Page' width='32' /></a></td>
<td class='online-navigation'><a rel="parent" title="8. structured Markup Processing"
  href="markup.html"><img src='../icons/up.png'
  border='0' height='32'  alt='Up one Level' width='32' /></a></td>
<td class='online-navigation'><a rel="next" title="8.1.1 example HTML Parser"
  href="htmlparser-example.html"><img src='../icons/next.png'
  border='0' height='32'  alt='Next Page' width='32' /></a></td>
<td align="center" width="100%">Python Library Reference</td>
<td class='online-navigation'><a rel="contents" title="Table of Contents"
  href="contents.html"><img src='../icons/contents.png'
  border='0' height='32'  alt='Contents' width='32' /></a></td>
<td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png'
  border='0' height='32'  alt='Module Index' width='32' /></a></td>
<td class='online-navigation'><a rel="index" title="Index"
  href="genindex.html"><img src='../icons/index.png'
  border='0' height='32'  alt='Index' width='32' /></a></td>
</tr></table>
<div class='online-navigation'>
<b class="navlabel">Previous:</b>
<a class="sectref" rel="prev" href="markup.html">8. Structured Markup Processing</a>
<b class="navlabel">Up:</b>
<a class="sectref" rel="parent" href="markup.html">8. Structured Markup Processing</a>
<b class="navlabel">Next:</b>
<a class="sectref" rel="next" href="htmlparser-example.html">8.1.1 Example HTML Parser</a>
</div>
<hr /></div>
</div>
<!--End of Navigation Panel-->

<h1><a name="SECTION0010100000000000000000">
8.1 <tt class="module">HTMLParser</tt> --
         Simple HTML and XHTML parser</a>
</h1>

<p>
<a name="module-HTMLParser"></a>

<p>

<span class="versionnote">New in version 2.2.</span>

<p>
This module defines a class <tt class="class">HTMLParser</tt> which serves as the
basis for parsing text files formatted in HTML<a id='l2h-1666' xml:id='l2h-1666'></a> (HyperText
Mark-up Language) and XHTML.<a id='l2h-1667' xml:id='l2h-1667'></a>  Unlike the parser in
<tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser is not based on the SGML parser in
<tt class="module"><a href="module-sgmllib.html">sgmllib</a></tt>.

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><span class="typelabel">class</span>&nbsp;<tt id='l2h-1650' xml:id='l2h-1650' class="class">HTMLParser</tt></b>(</nobr></td>
  <td><var></var>)</td></tr></table></dt>
<dd>
The <tt class="class">HTMLParser</tt> class is instantiated without arguments.

<p>
An HTMLParser instance is fed HTML data and calls handler functions
when tags begin and end.  The <tt class="class">HTMLParser</tt> class is meant to be
overridden by the user to provide a desired behavior.

<p>
Unlike the parser in <tt class="module"><a href="module-htmllib.html">htmllib</a></tt>, this parser does not check
that end tags match start tags or call the end-tag handler for
elements which are closed implicitly by closing an outer element.
</dl>

<p>
An exception is defined as well:

<p>
<dl><dt><b><span class="typelabel">exception</span>&nbsp;<tt id='l2h-1651' xml:id='l2h-1651' class="exception">HTMLParseError</tt></b></dt>
<dd>
Exception raised by the <tt class="class">HTMLParser</tt> class when it encounters an
error while parsing.  This exception provides three attributes:
<tt class="member">msg</tt> is a brief message explaining the error, <tt class="member">lineno</tt>
is the number of the line on which the broken construct was detected,
and <tt class="member">offset</tt> is the number of characters into the line at which
the construct starts.
</dd></dl>

<p>
<tt class="class">HTMLParser</tt> instances have the following methods:

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1652' xml:id='l2h-1652' class="method">reset</tt></b>(</nobr></td>
  <td><var></var>)</td></tr></table></dt>
<dd>
Reset the instance.  Loses all unprocessed data.  This is called
implicitly at instantiation time.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1653' xml:id='l2h-1653' class="method">feed</tt></b>(</nobr></td>
  <td><var>data</var>)</td></tr></table></dt>
<dd>
Feed some text to the parser.  It is processed insofar as it consists
of complete elements; incomplete data is buffered until more data is
fed or <tt class="method">close()</tt> is called.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1654' xml:id='l2h-1654' class="method">close</tt></b>(</nobr></td>
  <td><var></var>)</td></tr></table></dt>
<dd>
Force processing of all buffered data as if it were followed by an
end-of-file mark.  This method may be redefined by a derived class to
define additional processing at the end of the input, but the
redefined version should always call the <tt class="class">HTMLParser</tt> base class
method <tt class="method">close()</tt>.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1655' xml:id='l2h-1655' class="method">getpos</tt></b>(</nobr></td>
  <td><var></var>)</td></tr></table></dt>
<dd>
Return current line number and offset.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1656' xml:id='l2h-1656' class="method">get_starttag_text</tt></b>(</nobr></td>
  <td><var></var>)</td></tr></table></dt>
<dd>
Return the text of the most recently opened start tag.  This should
not normally be needed for structured processing, but may be useful in
dealing with HTML ``as deployed'' or for re-generating input with
minimal changes (whitespace between attributes can be preserved,
etc.).
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1657' xml:id='l2h-1657' class="method">handle_starttag</tt></b>(</nobr></td>
  <td><var>tag, attrs</var>)</td></tr></table></dt>
<dd> 
This method is called to handle the start of a tag.  It is intended to
be overridden by a derived class; the base class implementation does
nothing.  

<p>
The <var>tag</var> argument is the name of the tag converted to
lower case.  The <var>attrs</var> argument is a list of <code>(<var>name</var>,
<var>value</var>)</code> pairs containing the attributes found inside the tag's
<code>&lt;&gt;</code> brackets.  The <var>name</var> will be translated to lower case
and double quotes and backslashes in the <var>value</var> have been
interpreted.  For instance, for the tag <code>&lt;A
HREF="http://www.cwi.nl/"&gt;</code>, this method would be called as
"<tt class="samp">handle_starttag('a', [('href', 'http://www.cwi.nl/')])</tt>".
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1658' xml:id='l2h-1658' class="method">handle_startendtag</tt></b>(</nobr></td>
  <td><var>tag, attrs</var>)</td></tr></table></dt>
<dd>
Similar to <tt class="method">handle_starttag()</tt>, but called when the parser
encounters an XHTML-style empty tag (<code>&lt;a .../&gt;</code>).  This method
may be overridden by subclasses which require this particular lexical
information; the default implementation simple calls
<tt class="method">handle_starttag()</tt> and <tt class="method">handle_endtag()</tt>.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1659' xml:id='l2h-1659' class="method">handle_endtag</tt></b>(</nobr></td>
  <td><var>tag</var>)</td></tr></table></dt>
<dd>
This method is called to handle the end tag of an element.  It is
intended to be overridden by a derived class; the base class
implementation does nothing.  The <var>tag</var> argument is the name of
the tag converted to lower case.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1660' xml:id='l2h-1660' class="method">handle_data</tt></b>(</nobr></td>
  <td><var>data</var>)</td></tr></table></dt>
<dd>
This method is called to process arbitrary data.  It is intended to be
overridden by a derived class; the base class implementation does
nothing.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1661' xml:id='l2h-1661' class="method">handle_charref</tt></b>(</nobr></td>
  <td><var>name</var>)</td></tr></table></dt>
<dd> This method is called to
process a character reference of the form "<tt class="samp">&amp;#<var>ref</var>;</tt>".  It
is intended to be overridden by a derived class; the base class
implementation does nothing.  
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1662' xml:id='l2h-1662' class="method">handle_entityref</tt></b>(</nobr></td>
  <td><var>name</var>)</td></tr></table></dt>
<dd> 
This method is called to process a general entity reference of the
form "<tt class="samp">&amp;<var>name</var>;</tt>" where <var>name</var> is an general entity
reference.  It is intended to be overridden by a derived class; the
base class implementation does nothing.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1663' xml:id='l2h-1663' class="method">handle_comment</tt></b>(</nobr></td>
  <td><var>data</var>)</td></tr></table></dt>
<dd>
This method is called when a comment is encountered.  The
<var>comment</var> argument is a string containing the text between the
"<tt class="samp">--</tt>" and "<tt class="samp">--</tt>" delimiters, but not the delimiters
themselves.  For example, the comment "<tt class="samp">&lt;!--text--&gt;</tt>" will
cause this method to be called with the argument <code>'text'</code>.  It is
intended to be overridden by a derived class; the base class
implementation does nothing.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1664' xml:id='l2h-1664' class="method">handle_decl</tt></b>(</nobr></td>
  <td><var>decl</var>)</td></tr></table></dt>
<dd>
Method called when an SGML declaration is read by the parser.  The
<var>decl</var> parameter will be the entire contents of the declaration
inside the <code>&lt;!</code>...<code>&gt;</code> markup.  It is intended to be overridden
by a derived class; the base class implementation does nothing.
</dl>

<p>
<dl><dt><table cellpadding="0" cellspacing="0"><tr valign="baseline">
  <td><nobr><b><tt id='l2h-1665' xml:id='l2h-1665' class="method">handle_pi</tt></b>(</nobr></td>
  <td><var>data</var>)</td></tr></table></dt>
<dd>
Method called when a processing instruction is encountered.  The
<var>data</var> parameter will contain the entire processing instruction.
For example, for the processing instruction <code>&lt;?proc color='red'&gt;</code>,
this method would be called as <code>handle_pi("proc color='red'")</code>.  It
is intended to be overridden by a derived class; the base class
implementation does nothing.

<p>
<span class="note"><b class="label">Note:</b>
The <tt class="class">HTMLParser</tt> class uses the SGML syntactic rules for
processing instructions.  An XHTML processing instruction using the
trailing "<tt class="character">?</tt>" will cause the "<tt class="character">?</tt>" to be included in
<var>data</var>.</span>
</dl>

<p>

<p><br /></p><hr class='online-navigation' />
<div class='online-navigation'>
<!--Table of Child-Links-->
<a name="CHILD_LINKS"><strong>Subsections</strong></a>

<ul class="ChildLinks">
<li><a href="htmlparser-example.html">8.1.1 Example HTML Parser Application</a>
</ul>
<!--End of Table of Child-Links-->
</div>

<div class="navigation">
<div class='online-navigation'>
<p></p><hr />
<table align="center" width="100%" cellpadding="0" cellspacing="2">
<tr>
<td class='online-navigation'><a rel="prev" title="8. structured Markup Processing"
  href="markup.html"><img src='../icons/previous.png'
  border='0' height='32'  alt='Previous Page' width='32' /></a></td>
<td class='online-navigation'><a rel="parent" title="8. structured Markup Processing"
  href="markup.html"><img src='../icons/up.png'
  border='0' height='32'  alt='Up one Level' width='32' /></a></td>
<td class='online-navigation'><a rel="next" title="8.1.1 example HTML Parser"
  href="htmlparser-example.html"><img src='../icons/next.png'
  border='0' height='32'  alt='Next Page' width='32' /></a></td>
<td align="center" width="100%">Python Library Reference</td>
<td class='online-navigation'><a rel="contents" title="Table of Contents"
  href="contents.html"><img src='../icons/contents.png'
  border='0' height='32'  alt='Contents' width='32' /></a></td>
<td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png'
  border='0' height='32'  alt='Module Index' width='32' /></a></td>
<td class='online-navigation'><a rel="index" title="Index"
  href="genindex.html"><img src='../icons/index.png'
  border='0' height='32'  alt='Index' width='32' /></a></td>
</tr></table>
<div class='online-navigation'>
<b class="navlabel">Previous:</b>
<a class="sectref" rel="prev" href="markup.html">8. Structured Markup Processing</a>
<b class="navlabel">Up:</b>
<a class="sectref" rel="parent" href="markup.html">8. Structured Markup Processing</a>
<b class="navlabel">Next:</b>
<a class="sectref" rel="next" href="htmlparser-example.html">8.1.1 Example HTML Parser</a>
</div>
</div>
<hr />
<span class="release-info">Release 2.5.1, documentation updated on 18th April, 2007.</span>
</div>
<!--End of Navigation Panel-->
<address>
See <i><a href="about.html">About this document...</a></i> for information on suggesting changes.
</address>
</body>
</html>
